bandit experiment
lemmas
Throughout the paper, we assume the'stack of rewards model' from chapter 4.6 of [60]. Since A7 A 1 is continuous on the space of invertible matrices, the result follows by the continuous mapping theorem. Lemma 2 Consider the setup from Part II of the proof of Proposition 2. Define ˆθn,t = Σn,t Pt 1 i=1 1 {a?i = ai = a(j)}wn,i, S?t(j) St(j) 0 is the number of'a(j) mistakes', and is associated with positive regret when the inequality is strict. Observe that we must have t 1(S?t(j) St(j)) 0 in probability, as otherwise there would be c, > 0 such that lim sup P(t 1(S?t(j) St(j)) >) >c, implying lim sup T 1 E[R2sT ] lim sup T 1 E[RNT] lim sup E T > c>0, which contradicts the assumption lim sup T 1 E[R2sT ] 0 (recall = mini ri >0). Finally, t 1(S?t(j) St(j)) 0 implies Since an analogous argument can be made for the covariance term, and A7 A 1 is continuous on the space of invertible matrices, ˆθn,t θ?n in probability by the continuous mapping theorem, as desired.